-
Notifications
You must be signed in to change notification settings - Fork 920
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] Add support for DataFrame.from_dict
\to_dict
and Series.to_dict
#12048
Conversation
Codecov ReportBase: 87.47% // Head: 88.10% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## branch-22.12 #12048 +/- ##
================================================
+ Coverage 87.47% 88.10% +0.62%
================================================
Files 133 135 +2
Lines 21826 22133 +307
================================================
+ Hits 19093 19500 +407
+ Misses 2733 2633 -100
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This adds a lot more logic than I had anticipated. Out of curiosity, I copied this implementation into my current branch, then added a second method from_dict2
with an identical signature that is just defined as:
return cls.from_pandas(pd.DataFrame.from_dict(data, orient, dtype, columns))
Here are some quick benchmark results:
In [9]: d = {f'i': np.random.rand(1000) for i in range(20)}
In [10]: %timeit cudf.DataFrame.from_dict2(d)
279 µs ± 1.51 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [11]: %timeit cudf.DataFrame.from_dict(d)
315 µs ± 1.26 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [12]: d = {f'i': np.random.rand(10) for i in range(20)}
In [13]: %timeit cudf.DataFrame.from_dict2(d)
277 µs ± 345 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [14]: %timeit cudf.DataFrame.from_dict(d)
307 µs ± 253 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [15]: d = {f'i': np.random.rand(10000) for i in range(50)}
In [16]: %timeit cudf.DataFrame.from_dict2(d)
310 µs ± 322 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [17]: %timeit cudf.DataFrame.from_dict(d)
316 µs ± 120 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Is there any benefit to the current, much more complex approach rather than just using the pandas fallback in this case? The complexity of our constructor is high enough that it seems like bypassing it using from_pandas
recoups more than enough performance, and this way we really don't have to think about the semantics much at all.
So, yes this is for sure quicker for CPU-backed data..I had to vendor the entire logic from pandas and do a special handling to support In [1]: import cudf
In [2]: def v2(data):
...: return cudf.from_pandas(pd.DataFrame.from_dict(data))
...:
In [3]: import pandas as pd
In [4]: data = {"a": cudf.Series([1, 2, 3, 4]), "B": cudf.Series([10, 11, 12, 13])}
In [5]: v2(data)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In [5], line 1
----> 1 v2(data)
Cell In [2], line 2, in v2(data)
1 def v2(data):
----> 2 return cudf.from_pandas(pd.DataFrame.from_dict(data))
File /nvme/0/pgali/envs/cudfdev/lib/python3.9/site-packages/pandas/core/frame.py:1762, in DataFrame.from_dict(cls, data, orient, dtype, columns)
1756 raise ValueError(
1757 f"Expected 'index', 'columns' or 'tight' for orient parameter. "
1758 f"Got '{orient}' instead"
1759 )
1761 if orient != "tight":
-> 1762 return cls(data, index=index, columns=columns, dtype=dtype)
1763 else:
1764 realdata = data["data"]
File /nvme/0/pgali/envs/cudfdev/lib/python3.9/site-packages/pandas/core/frame.py:662, in DataFrame.__init__(self, data, index, columns, dtype, copy)
656 mgr = self._init_mgr(
657 data, axes={"index": index, "columns": columns}, dtype=dtype, copy=copy
658 )
660 elif isinstance(data, dict):
661 # GH#38939 de facto copy defaults to False only in non-dict cases
--> 662 mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
663 elif isinstance(data, ma.MaskedArray):
664 import numpy.ma.mrecords as mrecords
File /nvme/0/pgali/envs/cudfdev/lib/python3.9/site-packages/pandas/core/internals/construction.py:493, in dict_to_mgr(data, index, columns, dtype, typ, copy)
489 else:
490 # dtype check to exclude e.g. range objects, scalars
491 arrays = [x.copy() if hasattr(x, "dtype") else x for x in arrays]
--> 493 return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy)
File /nvme/0/pgali/envs/cudfdev/lib/python3.9/site-packages/pandas/core/internals/construction.py:123, in arrays_to_mgr(arrays, columns, index, dtype, verify_integrity, typ, consolidate)
120 index = ensure_index(index)
122 # don't force copy because getting jammed in an ndarray anyway
--> 123 arrays = _homogenize(arrays, index, dtype)
124 # _homogenize ensures
125 # - all(len(x) == len(index) for x in arrays)
126 # - all(x.ndim == 1 for x in arrays)
(...)
129
130 else:
131 index = ensure_index(index)
File /nvme/0/pgali/envs/cudfdev/lib/python3.9/site-packages/pandas/core/internals/construction.py:617, in _homogenize(data, index, dtype)
614 val = dict(val)
615 val = lib.fast_multiget(val, oindex._values, default=np.nan)
--> 617 val = sanitize_array(
618 val, index, dtype=dtype, copy=False, raise_cast_failure=False
619 )
620 com.require_length_match(val, index)
622 homogenized.append(val)
File /nvme/0/pgali/envs/cudfdev/lib/python3.9/site-packages/pandas/core/construction.py:616, in sanitize_array(data, index, dtype, copy, raise_cast_failure, allow_2d)
613 # materialize e.g. generators, convert e.g. tuples, abc.ValueView
614 if hasattr(data, "__array__"):
615 # e.g. dask array GH#38645
--> 616 data = np.array(data, copy=copy)
617 else:
618 data = list(data)
File /nvme/0/pgali/envs/cudfdev/lib/python3.9/site-packages/cudf/core/frame.py:447, in Frame.__array__(self, dtype)
446 def __array__(self, dtype=None):
--> 447 raise TypeError(
448 "Implicit conversion to a host NumPy array via __array__ is not "
449 "allowed, To explicitly construct a GPU matrix, consider using "
450 ".to_cupy()\nTo explicitly construct a host matrix, consider "
451 "using .to_numpy()."
452 )
TypeError: Implicit conversion to a host NumPy array via __array__ is not allowed, To explicitly construct a GPU matrix, consider using .to_cupy()
To explicitly construct a host matrix, consider using .to_numpy(). |
@@ -474,10 +474,9 @@ def from_dict(data, npartitions, orient="columns", **kwargs): | |||
|
|||
if orient != "columns": | |||
raise ValueError(f"orient={orient} is not supported") | |||
# TODO: Use cudf.from_dict | |||
# (See: https://github.com/rapidsai/cudf/issues/11934) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc: @rjzamora for review
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice if 1) we could leverage the pandas method directly in cases where we know that it would be faster, and 2) if we didn't have to handle so many edge cases before devolving to the constructor, but I don't see how to do (1) without doing introspection that will slow down other code paths and I think (2) seems necessary. If other reviewers have more ideas I would welcome them, but I think I'm good with this code as is.
Co-authored-by: Vyas Ramasubramani <[email protected]>
How about this rewrite: @staticmethod
def __create_index(indexlist, namelist, library):
try:
name, = namelist
return library.Index(indexlist, name=name)
except ValueError:
return library.MultiIndex.from_tuples(
indexlist, names=namelist
)
@classmethod
def from_dict(cls, data, orient="columns", dtype=None, index=None, columns=None):
if orient == "index":
if len(data) > 0 and isinstance(next(iter(data.values())), cudf.Series):
data = cls(data).T
return cls(data, index=None, columns=columns, dtype=dtype)
else:
return cls.from_pandas(
pd.DataFrame.from_dict(
data=data,
orient=orient,
dtype=dtype,
index=index,
columns=columns,
)
)
elif orient == "columns":
if columns is not None:
raise ValueError("Cannot use columns parameter with orient='columns'")
return cls(data, index=index, columns=None, dtype=dtype)
elif orient == "tight":
if columns is not None:
raise ValueError("Cannot use columns parameter with orient='right'")
index = cls.__create_index(data["index"], data["index_names"], cudf)
columns = cls.__create_index(data["columns"], data["column_names"], pd)
return cls(data["data"], index=index, columns=columns, dtype=dtype)
else:
raise ValueError(f"Expected 'index', 'columns', or 'tight' got {orient=}") Which (personally) I find a bit clearer. It simplifies the dict transpose in the case where the values are not cudf.Series by just deferring to pandas. This may, however, be bad if the values are provided as cupy arrays (I think). Other changes (might be worthwhile anyway): handle each One could imagine doing some introspection of the data in each of the cases, but it's probably not worth it. I would hope that |
Think this is a use case we should prepare for (and ideally test against) |
@wence- @jakirkham Addressed your reviews. Could you please re-review/approve when you get a chance? |
Also cc: @vyasr since the code has changed. |
@gpucibot merge |
Description
Resolves: #11934
DataFrame.from_dict
andDataFrame.to_dict
Series.to_dict
Checklist